Apache Mahout Cookbook by Unknown

Apache Mahout Cookbook by Unknown

Author:Unknown
Format: epub
Tags: Computers, Databases, Enterprise Applications, Business Intelligence Tools, Data Mining
Published: 0101-01-01T00:00:00+00:00


f

uciCarTrain.tsv (the source file)

f

rating ~ buying + maintenance + doors + persons + lug_boot +

safety (the way we train the logistic)

f

model.ser (the file model name)

110

Chapter 5

So basically, we try to identify the rating based on the assumption that the rating depends on the columns buying, maintenance, doors, person, lug_boot, and safety: The whole set of columns is buying, maintenance, doors, persons, lug_boot, safety, and rating. If we take a look at the source code in the GitHub repository, we can see that the logistictrain class is written in a way that respects the Hadoop specification for implementing the MapReduce job (see https://github.com/WinVector/Logistic/

blob/master/src/com/winvector/logistic/LogisticTrain.java).

The MapReduce job is performed by readying a portion of the same data file, transforming them into sequence files and then evaluating every portion of data to obtain a set of scores that are evaluated to find the best one. Next, the reducer aggregates all of the data into the model.ser file.

We point out that this implementation has been done to allow scaling to Hadoop, so the job that is running can be changed to define the mapper and the reducers.

See also

The whole implementation has been coded by John Mount, who also maintains a blog about data mining and the implementation of the logistic regression using Hadoop. We strongly suggest that you follow his blog, which is available at http://www.win-vector.com/blog/.

111

Stock Market Forecasting with Mahout

Using Random Forest to forecast market

movements

This recipe will act on the same dataset we have used so far. The difference is that we are going to use a different type of algorithm called Random Forest.

This kind of algorithm is very efficient in classifying forecasts on a huge set of predictors. We will use it to forecast Google's market movement, but a potentially better algorithm is the one that forecasts the movement of the NASDAQ based on all of the stock market titles that compose the NASDAQ basket. So, if for every tile we have five attributes (Open, Close, High, Low, and Volume), we could have hundreds of numerical predictors to forecast only a single moment.

This recipe has been built out of the information provided by Jennifer Smith on GitHub.

Getting ready

To get started with this example, create a new Maven project with the following command: mvn archetype:create -DarchetypeGroupId=org.apache.maven.archetypes

-DgroupId=com.packtpub.mahoutcookbook -DartifactId=chapter05b

We need to add the references to your local Mahout Maven project as well.

How to do it…

The basic steps for this algorithm are as follows:

1. Read a CSV value file to convert it into an array of strings.

2. Divide the resulting array into the training and testing data using a 90 percent approach, meaning that the training set is 90 percent of the whole dataset.

3. Run the runIteration method with a defined number of trees to create a descriptor object.

To run this example, we need to add the following code into the main method of the App.

java class automatically generated by Maven:

String trainingSetFile = "/mnt/new/logistic/train/google.csv"; int numberOfTrees = 100;

boolean useThresholding = true;

System.out.println("Building " + numberOfTrees + " trees."); String[] trainDataValues = fileAsStringArray(trainingSetFile, 42000, useThresholding);

112

Chapter 5

String[] testDataValues = new String[]{};

String descriptor = buildDescriptor(trainDataValues[0].



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.
Popular ebooks
Whisky: Malt Whiskies of Scotland (Collins Little Books) by dominic roskrow(56793)
What's Done in Darkness by Kayla Perrin(26812)
The Ultimate Python Exercise Book: 700 Practical Exercises for Beginners with Quiz Questions by Copy(20678)
De Souza H. Master the Age of Artificial Intelligences. The Basic Guide...2024 by Unknown(20453)
D:\Jan\FTP\HOL\Work\Alien Breed - Tower Assault CD32 Alien Breed II - The Horror Continues Manual 1.jpg by PDFCreator(20450)
The Fifty Shades Trilogy & Grey by E L James(19320)
Shot Through the Heart: DI Grace Fisher 2 by Isabelle Grey(19270)
Shot Through the Heart by Mercy Celeste(19135)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 10 by Isuna Hasekura and Jyuu Ayakura(17300)
Python GUI Applications using PyQt5 : The hands-on guide to build apps with Python by Verdugo Leire(17235)
Peren F. Statistics for Business and Economics...Essential Formulas 3ed 2025 by Unknown(17082)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 03 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(17004)
Wolf & Parchment: New Theory Spice & Wolf, Vol. 01 by Isuna Hasekura and Jyuu Ayakura & Jyuu Ayakura(16622)
The Subtle Art of Not Giving a F*ck by Mark Manson(14624)
The 3rd Cycle of the Betrayed Series Collection: Extremely Controversial Historical Thrillers (Betrayed Series Boxed set) by McCray Carolyn(14317)
Stepbrother Stories 2 - 21 Taboo Story Collection (Brother Sister Stepbrother Stepsister Taboo Pseudo Incest Family Virgin Creampie Pregnant Forced Pregnancy Breeding) by Roxi Harding(13983)
Scorched Earth by Nick Kyme(12969)
Drei Generationen auf dem Jakobsweg by Stein Pia(11153)
Suna by Ziefle Pia(11081)
Scythe by Neal Shusterman(10561)